Day 09. 用測試來寫爬蟲 - 依賴注入 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 9

Software Development

開心撰寫 PHPUnit系列第 9 篇

Day 09. 用測試來寫爬蟲 - 依賴注入

15th鐵人賽 phpunit

recca0120

2023-09-24 21:48:38

404 瀏覽

分享至

在上一篇我們是使用 Extract Method 的方式來寫爬蟲，這一篇我們來改用『依賴注入』的方式來重構程式碼
那 Extract Method 和依賴注入有什麼不一樣？這邊就不討論艱深的理論，我們來討論開發時的步驟

Extract Method

用 Extract Method 比較像我們是使用瀑布式開發，當遇到無法自己控制的部份，我們把程式移到 protected method 後再 override，雖然能夠立即得到反饋，但該物件就會變的不夠『單一職責』

那我們改為『依賴注入』會有什麼不一樣的地方呢？

依賴注入

在開始寫 code 之前，我們需要先去思考這次的需求需要完成哪些程式，這個需求我會把它分為兩個部份

爬取網頁
分析網頁

當確認要完成這個需求需要兩個物件之後，我們自然就會想要建立兩個物件

HttpClient
Crawler

所以我們的程式碼就會變為這樣

<?php
// tests/PttCrawlerTest.php

namespace Recca0120\Ithome30\Tests;

use PHPUnit\Framework\TestCase;
use Recca0120\Ithome30\PttCrawler;
use Recca0120\Ithome30\HttpClient;

class PttCrawlerTest extends TestCase
{
    public function test_fetch_board_page()
    {
        $crawler = new PttCrawler(new FakeHttpClient());
        $records = $crawler->all();

        self::assertEquals([
            'name' => 'Gossiping',
            "nuser" => '12185',
            'class' => '綜合',
            'title' => '[八卦]不停重複今日公祭明日忘記',
        ], $records[0]);
    }
}

class FakeHttpClient extends HttpClient
{
    public function get()
    {
        return file_get_contents(__DIR__ . '/fixtures/ptt_home.html');
    }
}

<?php
// src/HttpClient.php

namespace Recca0120\Ithome30;

class HttpClient
{
    public function get()
    {
        return file_get_contents('https://www.ptt.cc/bbs/hotboards.html');
    }
}

<?php
// src/PttCrawler.php

namespace Recca0120\Ithome30;

class PttCrawler
{
    public function __construct(private HttpClient $httpClient)
    {
    }

    public function all()
    {
        return array_map(
            fn (string $row) => $this->parseCols($row),
            $this->parseRows($this->httpClient->get())
        );
    }

    private function parseCols($row)
    {
        preg_match_all('/"board-(?<name>\w+)">(?<value>.+?)<\/div>/', $row, $matches);
        $cols = [];
        foreach (array_keys($matches[0]) as $index) {
            $name = $matches['name'][$index];
            $value = $matches['value'][$index];
            $cols[$name] = str_replace('&#9678;', '', strip_tags($value));
        }

        return $cols;
    }

    private function parseRows($html)
    {
        preg_match_all('/<a\sclass="board"[^>]*>.+?<\/a>/s', $html, $matches);

        return $matches[0];
    }
}

這時候我們再來看我們的 PttCrawlerTest，我們的假物件就會變更為 FakeHttpClient，這樣我們單從測試案例就可以得知不可控的因素為『抓取網頁資料』，讓程式的可讀性可以再次的提升，更重要的是我們讓物件的真的遵守『單一職責』了